Goto

Collaborating Authors

 Opole


Protecting De-identified Documents from Search-based Linkage Attacks

Lison, Pierre, Anderson, Mark

arXiv.org Artificial Intelligence

While de-identification models can help conceal the identity of the individual(s) mentioned in a document, they fail to address linkage risks, defined as the potential to map the de-identified text back to its source. One straightforward way to perform such linkages is to extract phrases from the de-identified document and then check their presence in the original dataset. This paper presents a method to counter search-based linkage attacks while preserving the semantic integrity of the text. The method proceeds in two steps. We first construct an inverted index of the N-grams occurring in the document collection, making it possible to efficiently determine which N-grams appear in less than $k$ documents (either alone or in combination with other N-grams). An LLM-based rewriter is then iteratively queried to reformulate those spans until linkage is no longer possible. Experimental results on a collection of court cases show that the method is able to effectively prevent search-based linkages while remaining faithful to the original content.


Two new approaches to multiple canonical correlation analysis for repeated measures data

Górecki, Tomasz, Krzyśko, Mirosław, Gnettner, Felix, Kokoszka, Piotr

arXiv.org Machine Learning

In classical canonical correlation analysis (CCA), the goal is to determine the linear transformations of two random vectors into two new random variables that are most strongly correlated. Canonical variables are pairs of these new random variables, while canonical correlations are correlations between these pairs. In this paper, we propose and study two generalizations of this classical method: (1) Instead of two random vectors we study more complex data structures that appear in important applications. In these structures, there are $L$ features, each described by $p_l$ scalars, $1 \le l \le L$. We observe $n$ such objects over $T$ time points. We derive a suitable analog of the CCA for such data. Our approach relies on embeddings into Reproducing Kernel Hilbert Spaces, and covers several related data structures as well. (2) We develop an analogous approach for multidimensional random processes. In this case, the experimental units are multivariate continuous, square-integrable functions over a given interval. These functions are modeled as elements of a Hilbert space, so in this case, we define the multiple functional canonical correlation analysis, MFCCA. We justify our approaches by their application to two data sets and suitable large sample theory. We derive consistency rates for the related transformation and correlation estimators, and show that it is possible to relax two common assumptions on the compactness of the underlying cross-covariance operators and the independence of the data.


Enhancing Cluster Scheduling in HPC: A Continuous Transfer Learning for Real-Time Optimization

Sliwko, Leszek, Mizera-Pietraszko, Jolanta

arXiv.org Artificial Intelligence

This is the accepted version of the paper publis hed in 2025 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) . Given Name Surname line 2: dept. Given Name Surname line 2: dept. Abstract -- This study presents a machine learning - assisted approach to optimize task scheduling in cluster systems, focusing on node - affinity constraints. Traditional schedulers like Kubernetes struggle with real - time adaptability, whereas the proposed continuous transfer learning model evolves dynamically during operations, minimizing retraining needs. Evaluated on Google Cluster Data, the model achieves over 99% accuracy, reducing computational overhead and improving scheduling latency for constrained tasks. This scalable solution enables real - time optimization, advancing ma chine learning integration in cluster management and paving the way for future adaptive scheduling strategies. In the rapidly evolving landscape of cloud computing and distributed high - performance environments, the efficient management of architectural and software resources became apparently paramount for ensuring suitable performance and minimizing latency. As long as the industry organizations increasingly rely on cluster - based architectures to orchestrate their broad areas of possible applications, the importance of effective task scheduling has come to the forefront . Over the last few years, traditional schedulers, such as Kubernetes and some more, have laid the groundwork for managing containerized workloads; however, it was found that it poses a challenge for them to adapt to the dynamic nature of real - time workloads and node - affinity constraints [ 35 ] . These limitations result in inefficient resource utilization and longer scheduling delays, which ultimately affect overall system performance, especially in high - performance systems [9][18] . In mission - critical environments, these issues can escalate, disrupting vital systems like power networks, healthcare, defen s e systems, and others.


Using ensemble methods of machine learning to predict real estate prices

Pastukh, Oleh, Khomyshyn, Viktor

arXiv.org Artificial Intelligence

In recent years, machine learning (ML) techniques have become a powerful tool for improving the accuracy of predictions and decision-making. Machine learning technologies have begun to penetrate all areas, including the real estate sector. Correct forecasting of real estate value plays an important role in the buyer-seller chain, because it ensures reasonableness of price expectations based on the offers available in the market and helps to avoid financial risks for both parties of the transaction. Accurate forecasting is also important for real estate investors to make an informed decision on a specific property. This study helps to gain a deeper understanding of how effective and accurate ensemble machine learning methods are in predicting real estate values. The results obtained in the work are quite accurate, as can be seen from the coefficient of determination (R^2), root mean square error (RMSE) and mean absolute error (MAE) calculated for each model. The Gradient Boosting Regressor model provides the highest accuracy, the Extra Trees Regressor, Hist Gradient Boosting Regressor and Random Forest Regressor models give good results. In general, ensemble machine learning techniques can be effectively used to solve real estate valuation. This work forms ideas for future research, which consist in the preliminary processing of the data set by searching and extracting anomalous values, as well as the practical implementation of the obtained results.


On Revealing the Hidden Problem Structure in Real-World and Theoretical Problems Using Walsh Coefficient Influence

Przewozniczek, M. W., Chicano, F., Tinós, R., Nalepa, J., Ruszczak, B., Wijata, A. M.

arXiv.org Artificial Intelligence

Gray-box optimization employs Walsh decomposition to obtain non-linear variable dependencies and utilize them to propose masks of variables that have a joint non-linear influence on fitness value. These masks significantly improve the effectiveness of variation operators. In some problems, all variables are non-linearly dependent, making the aforementioned masks useless. We analyze the features of the real-world instances of such problems and show that many of their dependencies may have noise-like origins. Such noise-caused dependencies are irrelevant to the optimization process and can be ignored. To identify them, we propose extending the use of Walsh decomposition by measuring variable dependency strength that allows the construction of the weighted dynamic Variable Interaction Graph (wdVIG). wdVIGs adjust the dependency strength to mixed individuals. They allow the filtering of irrelevant dependencies and re-enable using dependency-based masks by variation operators. We verify the wdVIG potential on a large benchmark suite. For problems with noise, the wdVIG masks can improve the optimizer's effectiveness. If all dependencies are relevant for the optimization, i.e., the problem is not noised, the influence of wdVIG masks is similar to that of state-of-the-art structures of this kind.


Phoeni6: a Systematic Approach for Evaluating the Energy Consumption of Neural Networks

Oliveira-Filho, Antônio, Silva-de-Souza, Wellington, Sakuyama, Carlos Alberto Valderrama, Xavier-de-Souza, Samuel

arXiv.org Artificial Intelligence

This paper presents Phoeni6, a systematic approach for assessing the energy consumption of neural networks while upholding the principles of fair comparison and reproducibility. The methodology automates energy evaluations through containerized tools, robust database management, and versatile data models. In the first case study, the energy consumption of AlexNet and MobileNet was compared using raw and resized images. Results showed that MobileNet is up to 6.25% more energy-e fficient for raw images and 2.32% for resized datasets, while maintaining competitive accuracy levels. In the second study, the impact of image file formats on energy consumption was evaluated. BMP images reduced energy usage by up to 30% compared to PNG, highlighting the influence of file formats on energy e fficiency. These findings emphasize the importance of Phoeni6 in optimizing energy consumption for diverse neural network applications and establishing sustainable artificial intelligence practices. Introduction Deep Neural Networks (DNN) are being used with relative success in fields such as computer vision and natural language processing) [1, 2]. A growing number of initiatives have been promoting the development of these networks to solve everyday problems, including optimizing resource allocation in energy-constrained environments like wireless sensor networks [3]. There are repositories [4, 5] with hundreds of networks created and made available in lists ordered by accuracy, which is the primary metric used to assess the quality of each network. Their results emphasize that the search for energy efficiency can significantly benefit mobile devices' autonomy and positively a ff ect the financial costs and carbon footprints of large data centers distributed worldwide. These works measure energy to evaluate their technique. There is an evident global concern for the energy consumption of software products that a ffect people's daily lives--neural networks are becoming one of them. This fact has important implications on the criteria used to choose these products. It is reasonable to say that energy consumption is becoming part of the criteria for selecting neural networks, just as accuracy is. However, unlike the accuracy calculation, which fundamentally depends on the dataset and the procedures used during the training phase, the energy calculation depends on the devices involved. This aspect adds extra challenges to reproducing the results (RR) and making fair comparisons (FC) between di ff er-ent networks [24]. Evaluating the energy consumption of neural networks while adhering to the principles of Fair Comparison (FC) and Result Reproducibility (RR) presents significant challenges.


Modelling of automotive steel fatigue lifetime by machine learning method

Yasniy, Oleh, Tymoshchuk, Dmytro, Didych, Iryna, Zagorodna, Nataliya, Malyshevska, Olha

arXiv.org Artificial Intelligence

In the current study, the fatigue life of QSTE340TM steel was modelled using a machine learning method, namely, a neural network. This problem was solved by a Multi-Layer Perceptron (MLP) neural network with a 3-75-1 architecture, which allows the prediction of the crack length based on the number of load cycles N, the stress ratio R, and the overload ratio Rol. The proposed model showed high accuracy, with mean absolute percentage error (MAPE) ranging from 0.02% to 4.59% for different R and Rol. The neural network effectively reveals the nonlinear relationships between input parameters and fatigue crack growth, providing reliable predictions for different loading conditions.


The OPS-SAT benchmark for detecting anomalies in satellite telemetry

Ruszczak, Bogdan, Kotowski, Krzysztof, Evans, David, Nalepa, Jakub

arXiv.org Artificial Intelligence

Detecting anomalous events in satellite telemetry is a critical task in space operations. This task, however, is extremely time-consuming, error-prone and human dependent, thus automated data-driven anomaly detection algorithms have been emerging at a steady pace. However, there are no publicly available datasets of real satellite telemetry accompanied with the ground-truth annotations that could be used to train and verify anomaly detection supervised models. In this article, we address this research gap and introduce the AI-ready benchmark dataset (OPSSAT-AD) containing the telemetry data acquired on board OPS-SAT -- a CubeSat mission which has been operated by the European Space Agency which has come to an end during the night of 22--23 May 2024 (CEST). The dataset is accompanied with the baseline results obtained using 30 supervised and unsupervised classic and deep machine learning algorithms for anomaly detection. They were trained and validated using the training-test dataset split introduced in this work, and we present a suggested set of quality metrics which should be always calculated to confront the new algorithms for anomaly detection while exploiting OPSSAT-AD. We believe that this work may become an important step toward building a fair, reproducible and objective validation procedure that can be used to quantify the capabilities of the emerging anomaly detection techniques in an unbiased and fully transparent way.


European Space Agency Benchmark for Anomaly Detection in Satellite Telemetry

Kotowski, Krzysztof, Haskamp, Christoph, Andrzejewski, Jacek, Ruszczak, Bogdan, Nalepa, Jakub, Lakey, Daniel, Collins, Peter, Kolmas, Aybike, Bartesaghi, Mauro, Martinez-Heras, Jose, De Canio, Gabriele

arXiv.org Artificial Intelligence

Machine learning has vast potential to improve anomaly detection in satellite telemetry which is a crucial task for spacecraft operations. This potential is currently hampered by a lack of comprehensible benchmarks for multivariate time series anomaly detection, especially for the challenging case of satellite telemetry. The European Space Agency Benchmark for Anomaly Detection in Satellite Telemetry (ESA-ADB) aims to address this challenge and establish a new 1 standard in the domain. It is a result of close cooperation between spacecraft operations engineers from the European Space Agency (ESA) and machine learning experts. The newly introduced ESA Anomalies Dataset contains annotated real-life telemetry from three different ESA missions, out of which two are included in ESA-ADB. Results of typical anomaly detection algorithms assessed in our novel hierarchical evaluation pipeline show that new approaches are necessary to address operators' needs. All elements of ESA-ADB are publicly available to ensure its full reproducibility.


Modeling User Preferences via Brain-Computer Interfacing

Leiva, Luis A., Traver, V. Javier, Kawala-Sterniuk, Alexandra, Ruotsalo, Tuukka

arXiv.org Artificial Intelligence

Present Brain-Computer Interfacing (BCI) technology allows inference and detection of cognitive and affective states, but fairly little has been done to study scenarios in which such information can facilitate new applications that rely on modeling human cognition. One state that can be quantified from various physiological signals is attention. Estimates of human attention can be used to reveal preferences and novel dimensions of user experience. Previous approaches have tackled these incredibly challenging tasks using a variety of behavioral signals, from dwell-time to click-through data, and computational models of visual correspondence to these behavioral signals. However, behavioral signals are only rough estimations of the real underlying attention and affective preferences of the users. Indeed, users may attend to some content simply because it is salient, but not because it is really interesting, or simply because it is outrageous. With this paper, we put forward a research agenda and example work using BCI to infer users' preferences, their attentional correlates towards visual content, and their associations with affective experience. Subsequently, we link these to relevant applications, such as information retrieval, personalized steering of generative models, and crowdsourcing population estimates of affective experiences.